This section aims at clarifying, what the motivation behind the project is, what we are going to investigate and how we are going to proceed.
The people on Earth live increasingly in highly-dense cities. San Francisco is no exception to this trend. The people and environment surrounding one is playing a critical role for one's happiness and overall well-being. However, very little information is given on the behavoir of neighbours and the state of the neighbourhood when acquiring an appartment or house.
We aim at creating an interactive website, to provide the residents (and newcomers) of San Fransisco a tool to self-navigate around the city, investigate the state of each neighbourhood, get to better know the people living there and find the best suitable neighbourhood to a set of preferences. In the end, the goal is to allow the residents to:
To achieve these objectives we build on three data sets, which are explained in the following.
San Francisco 311 Cases
California House Prices
San Francisco Neighborhood Socio-Economic Profiles
After exploring the data availabe in the Data SF, the San Francisco's city hall open data portal, the dataset 311 Cases was selected at first.
The dataset contains service requests sent to SF311, the primary customer service center for the City of San Francisco. In this dataset, we found reportings on disturbance mapped to a location. This observation gave us the idea of combining the dataset with house prices to build a virtual data-driven real estate agent.
It contains created service requests listings back to the 1st of July 2008 with location information.
The biggest factor when buying a house is undeniably the price. Therefore, we find it necessary to provide information on the prices of houses sold within each neighborhood when suggesting a suitable neighborhood to end-reader.
The data set consists of houses sold in California in 2020. We filter the dataset to only contain houses sold in San Francisco, i.e. ~ 10.000 houses.
To understand neighborhoods better, we wanted to gather demographic information on each neighborhood, e.g. population, race/ethnicity, age etc. In this search, we came accross a report on socio-economic profiles published by the San Francisco Planning Department in 2018. The report is a pdf which we choose to manually transform into a CSV file.
The data used for the neighborhood profiles were collected over a five year period. The profiles holds only few fields in absolute numbers. Instead, the profiles are of fields commonly presented as percentage shares. Statistics in each neighborhood profiles come from two datasets produced by the U.S. Census Bureau released in December 2017.
The visuals we create using the afforementioned data sets have one thing in common: user interactivity. The user should be capable of exploring deeper those elements that might be more interesting for himself.
This section aims at explaining how the analysed data was extracted and what processing techniques were applied to secure data quality.
# install requirements
#pip install plotly==4.14.3
#pip install pip install plotly-express==0.4.1
# import requirements
import pandas as pd
import numpy as np
from datetime import datetime as dt
import re, string
import folium
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
import geojson
from sklearn import preprocessing
from sorted_months_weekdays import Month_Sorted_Month, Weekday_Sorted_Week
import calendar
We created a few helping functions to make data processing more handy.
# setup dictionary to standardise neighborhoods to the SF Planning Neighborhood standard
neighborhood_mapping = {
'Cathedral Hill' : 'Western Addition',
'Cole Valley' : 'Haight Ashbury',
'Silver Terrace' : 'Bayview',
'Tenderloin' : 'Downtown/Civic Center',
'Mission Dolores' : 'Mission',
'Lower Nob Hill' : 'Nob Hill',
'Duboce Triangle' : 'Castro/Upper Market',
'Bret Harte' : 'Bayview',
'Union Street' : 'Marina',
'Civic Center' : 'Downtown/Civic Center',
'Central Waterfront' : 'Potrero Hill',
'Hayes Valley' : 'Western Addition',
'Castro' : 'Castro/Upper Market',
'Parkmerced' : 'Lakeshore',
'Alamo Square' : 'Western Addition',
'Lower Pacific Heights' : 'Western Addition',
'Mt. Davidson Manor' : 'West of Twin Peaks',
'Corona Heights' : 'Castro/Upper Market',
'Mission Terrace' : 'Outer Mission',
'Downtown / Union Square' : 'Downtown/Civic Center',
'Showplace Square' : 'South of Market',
'Mint Hill' : 'Mission',
'Golden Gate Heights' : 'Inner Sunset',
'Portola' : 'Excelsior',
'Panhandle' : 'Haight Ashbury',
'Holly Park' : 'Bernal Heights',
'University Mound' : 'Excelsior',
'Sunnyside' : 'Outer Mission',
'Sutro Heights' : 'Seacliff',
'Ingleside' : 'Ocean View',
'Cow Hollow' : 'Marina',
'Upper Market' : 'Castro/Upper Market',
'Polk Gulch' : 'Nob Hill',
'Produce Market' : 'Bayview',
'Cayuga' : 'Outer Mission',
'Buena Vista' : 'Haight Ashbury',
'St. Francis Wood' : 'West of Twin Peaks',
'Miraloma Park' : 'West of Twin Peaks',
'Lone Mountain' : 'Inner Richmond',
'Merced Heights' : 'Ocean View',
'India Basin' : 'Bayview',
'Japantown' : 'Western Addition',
'Rincon Hill' : 'South of Market',
'South Beach' : 'South of Market',
'Merced Manor' : 'Lakeshore',
'Eureka Valley' : 'Castro/Upper Market',
'Aquatic Park / Ft. Mason' : 'Russian Hill',
'Parnassus Heights' : 'Inner Sunset',
'Oceanview' : 'Ocean View',
'Lower Haight' : 'Western Addition',
'Hunters Point' : 'Bayview',
'West Portal' : 'West of Twin Peaks',
'Presidio National Park' : 'Presidio',
'Westwood Park' : 'West of Twin Peaks',
'Candlestick Point SRA' : 'Bayview',
'Northern Waterfront' : 'North Beach',
'Fairmount' : 'Glen Park',
'Balboa Terrace' : 'West of Twin Peaks',
'Dogpatch' : 'Potrero Hill',
'Forest Knolls' : 'Inner Sunset',
'Ashbury Heights' : 'Haight Ashbury',
'Forest Hill' : 'West of Twin Peaks',
'McLaren Park' : 'Visitacion Valley',
'Dolores Heights' : 'Castro/Upper Market',
'Mission Bay' : 'South of Market',
'Sunnydale' : 'Visitacion Valley',
'Telegraph Hill' : 'North Beach',
'Ingleside Terraces' : 'Ocean View',
"St. Mary's Park" : 'Bernal Heights',
'Presidio Terrace' : 'Presidio Heights',
'Laurel Heights / Jordan Park' : 'Presidio Heights',
'Peralta Heights' : 'Bernal Heights',
'Laguna Honda' : 'Inner Sunset',
'Sherwood Forest' : 'West of Twin Peaks',
'Midtown Terrace' : 'Twin Peaks',
'Lake Street' : 'Inner Richmond',
"Fisherman's Wharf" : 'North Beach',
'Apparel City' : 'Bayview',
'Stonestown' : 'Lakeshore',
'Clarendon Heights' : 'Twin Peaks',
'Anza Vista' : 'Western Addition',
'Lincoln Park / Ft. Miley' : 'Seacliff',
'Little Hollywood' : 'Visitacion Valley',
'Westwood Highlands' : 'West of Twin Peaks',
'Yerba Buena Island' : 'Treasure Island/YBI',
'Treasure Island' : 'Treasure Island/YBI',
'Monterey Heights' : 'West of Twin Peaks',
'8' : 'Outer Richmond'
}
# setup dictionary to standardise Categories into our definition of Main Categories
catagory_mapping = {
"Sidewalk or Curb" : "Defects",
"General Request - PUBLIC WORKS" : "Defects",
"Sewer Issues" : "Defects",
"Street Defects" : "Defects",
"Streetlights" : "Defects",
"Street and Sidewalk Cleaning" : "Trash",
"Litter Receptacles" : "Trash",
"Graffiti" : "Vandalism",
'Damaged Property' : "Vandalism",
"Encampments" : "Loiterer",
"Homeless Concerns" : "Loiterer",
"Abandoned Vehicle" : "Loiterer",
"Parking Enforcement" : "Restricted Mobility",
"Blocked Street or SideWalk" : "Restricted Mobility",
"Abandoned Vehicle" : "Restricted Mobility",
"Noise Report" : "Noise"
}
column_types_311 = {'Category': 'category',
'Focus Category': 'category',
'Opened_DayOfWeek': 'uint8',
'Opened_Hour': 'uint8',
'Opened_HourOfWeek': 'uint8',
'Opened_Month': 'uint8',
'Opened_Year': 'uint16',
'SF Neighborhood': 'category',
'TimeToClose': 'float32'}
column_types_hp = {'Id': 'uint16',
'Neighborhood': 'category',
'Sold Price': 'float32',
'formatted_address': 'category'}
to_weekday = {
0 : "Monday",
1 : "Tuesday",
2 : "Wednesday",
3 : "Thursday",
4 : "Friday",
5 : "Saturday",
6 : "Sunday"
}
def correct_neighborhood(n, mapping):
"""
input: neighborhood and the neighborhood mapping dictionary
output: correct neighborhood according to standard
"""
if n in mapping.keys():
return mapping[n]
return n
def correct_catagory(c, mapping):
"""
input: category and the category mapping dictionary
output: correct category according to standard
"""
return mapping[c]
def create_polygons(path_to_gj):
"""
input: path to geojson file
output: polygons of provided set coordinate of each neighorhood
"""
polygons = {} #filling a dictionary with polygons
with open(path_to_gj) as f:
gj = geojson.load(f)
for n in gj['features']: # creating a polygon for each neighborhood using the Shapely library
polygons[n['properties']['neighborho']]=Polygon(n['geometry']['coordinates'][0][0])
return polygons
def point_to_neighborhood(longitude, latitude):
"""
input: coordinate (longitude, latitude)
output: the neighborhood where coordinates belong
"""
polygons = create_polygons('data/sf_planning_neighborhoods.geojson')
point = Point(longitude, latitude) #initiating point using the Shapely library
for n in polygons.keys(): #looping through each neighbor
if point.within(polygons[n]): #checking whether point is within neighborhood using the Shapely library
return n
def read_geojson(path_to_file):
"""
input: path to geojson file
output: geojson formatted to plotly
"""
with open(path_to_file) as f:
gj = geojson.load(f)
return gj
We start out by loading the data and filter it according to the variable that are most relevant for our analysis.
#Importing the raw data of 311 service requests
df_311 = pd.read_csv("data/311_Cases.csv", sep=",")
#Filter data to only include relevant columns
df_311 = df_311[['CaseID', 'Opened', 'Closed', 'Updated', 'Status', 'Status Notes',
'Responsible Agency', 'Category', 'Request Type', 'Request Details',
'Address', 'Street', 'Supervisor District', 'Neighborhood',
'Police District', 'Latitude', 'Longitude', 'Source', 'Media URL',
]]
#Remove instances where neighborhood is NaN
df_311 = df_311[df_311['Neighborhood'].notna()]
#Importing the raw data of house prices of San Francisco
df_housePrices = pd.read_csv("data/house_prices.csv", sep=",", index_col=0)
#Filter data to only include relevant columns
df_housePrices = df_housePrices[['Id', 'Address', 'Sold Price', 'City', 'Zip', 'State',
'Region','longitude', 'latitude', 'formatted_address']]
df_housePrices = df_housePrices[df_housePrices['Region'] == "San Francisco"]
#Importing the raw data of the socio-economic neighborhood profiles of San Francisco
df_neighborhood_profiles = pd.read_csv("data/SF_neighborhoods_socio_economic.csv", sep=";")
Let's have a first look at the 311 data.
# print columns, number of rows and number of columns
print("Dataset columns: ")
print(df_311.columns)
print("\n Total rows: " +str(df_311.shape[0]))
print("\n Total columns: " +str(df_311.shape[1]))
df_311.head()
Dataset columns:
Index(['CaseID', 'Opened', 'Closed', 'Updated', 'Status', 'Status Notes',
'Responsible Agency', 'Category', 'Request Type', 'Request Details',
'Address', 'Street', 'Supervisor District', 'Neighborhood',
'Police District', 'Latitude', 'Longitude', 'Source', 'Media URL'],
dtype='object')
Total rows: 4399166
Total columns: 19
| CaseID | Opened | Closed | Updated | Status | Status Notes | Responsible Agency | Category | Request Type | Request Details | Address | Street | Supervisor District | Neighborhood | Police District | Latitude | Longitude | Source | Media URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10622276 | 03/19/2019 04:44:32 PM | NaN | 08/28/2020 01:31:06 AM | Open | accepted | DPW BSM Queue | Sidewalk or Curb | Sidewalk_Defect | Collapsed_sidewalk | 361 MISSISSIPPI ST, SAN FRANCISCO, CA, 94107 | MISSISSIPPI ST | 10.0 | Potrero Hill | BAYVIEW | 37.761560 | -122.394172 | Web | NaN |
| 1 | 10704816 | 04/09/2019 07:08:53 AM | NaN | 08/28/2020 01:31:03 AM | Open | accepted | DPW BSM Queue | Sidewalk or Curb | Sidewalk_Defect | Lifted_sidewalk_tree_roots | 1407 GOUGH ST, SAN FRANCISCO, CA, 94109 | GOUGH ST | 5.0 | Cathedral Hill | NORTHERN | 37.786767 | -122.425109 | Mobile/Open311 | NaN |
| 2 | 10892486 | 05/21/2019 04:11:00 PM | 08/28/2020 07:46:00 AM | 08/28/2020 07:46:00 AM | Closed | Case Resolved - Loose PG&E vault doors and sun... | DPW BSM Queue | Sidewalk or Curb | Sidewalk_Defect | Lifted_sidewalk_other | Intersection of CARL ST and COLE ST | CARL ST | 5.0 | Cole Valley | PARK | 37.765800 | -122.449959 | Phone | NaN |
| 3 | 11866528 | 12/27/2019 12:09:00 PM | 01/08/2020 09:30:13 PM | 01/08/2020 09:30:13 PM | Closed | Case Resolved - Per process - cases are closed... | Clear Channel - Transit Queue | Street and Sidewalk Cleaning | Transit_shelter_platform | Transit_shelter_platform | Intersection of MARKET ST and DRUMM ST | MARKET ST | 3.0 | Financial District | CENTRAL | 37.793255 | -122.396301 | Phone | NaN |
| 4 | 10650674 | 03/26/2019 08:26:28 PM | NaN | 07/15/2020 01:25:53 AM | Open | accepted | DPW - Bureau of Street Use and Mapping - G | General Request - PUBLIC WORKS | request_for_service | bsm - request_for_service | 868 FOLSOM ST, SAN FRANCISCO, CA, 94107 | FOLSOM ST | 6.0 | South of Market | SOUTHERN | 37.780903 | -122.402802 | Integrated Agency | NaN |
For our preliminary defined analysis, the most important columns of the San Francisco dataset are Categoryand Neighborhood. Let's start by taking a closer look at those two columns.
When creating a 311 request, it is assigned to a specific Category (e.g. Noise Report or Graffiti). Within these categories, each request is assigned to a subcategory called Request Type. Above that, the user is then also able to specify some Request Details. In order to get an overview of the different requests the data set contains, let's have a look at how many there are.
# print categories, request types and request details
categories = df_311.Category.unique()
requestTypes = df_311['Request Type'].unique()
requestDetails = df_311['Request Details'].unique()
print('Number of categories:', len(categories))
print('Number of unique request types:', len(requestTypes))
print('Number of unique request details:', len(requestDetails))
Number of categories: 101 Number of unique request types: 397 Number of unique request details: 300541
Alright, already on the top level Category we have 108 different categories. In order to make our visualizations most relevant for the user and to not overwhelm with too many details, we choose to filter the dataset to only contain a set of most important categories, which we consider most relevant for the analysis. We chose categories that are of major concern when deciding for a neighborhood, i.e. issues that are particularly annoying, unpleasant or disturbant.
Let's start by looking at the distribution of the 101 different catagories.
# defining a set of focus categories to filter the dataset
focuscategories = ["Sidewalk or Curb", "Street and Sidewalk Cleaning", "General Request - PUBLIC WORKS", "Parking Enforcement", "Graffiti",
"Litter Receptacles", "Encampments", "Noise Report", "Sewer Issues", "Street Defects", "Streetlights", "Blocked Street or SideWalk",
"Abandoned Vehicle", 'Damaged Property', "Homeless Concerns"]
# creating a bar chart just to get an intuition of how frequent the different categories are
fig = plt.figure(figsize = (22,22))
# counting number of requests for each Category
category_counts = df_311.groupby('Category').size()
category_counts = category_counts.sort_values(ascending = True)
# set up bar chart
barplot = plt.barh(category_counts.index, category_counts)
for i in range(len(category_counts)):
if category_counts.index[i] in focuscategories:
barplot[i].set_color('r') # coloring focus categories in a different color
plt.xlabel('category count')
plt.title('Counts for the different 311 case categories in San Francisco (focus categories in red)')
plt.show()
Looking a the distribution one can see a highly right-skewed distribution with a long tale of many sparse catagories. We choose 15 of the top 25 categories to be our focus categories. To our judgement, these 15 categories best describe disturbant behavior. In the above plot, the Focus Categoriesare highlighted in red.
#filter dataframe according to the set of focus categories
df_311 = df_311[df_311['Category'].isin(focuscategories)]
print("\n Total rows: " +str(df_311.shape[0]))
Total rows: 3725194
The dataset now contains 3.725.194 service requests, i.e. ~15% of the service requests are removed.
However, the focus-categories include names such as General Request - PUBLIC WORKS, Sidewalk or Curb etc. which are not that meaningful to the end-reader. To make it even easier for us to visualize and for the end-reader to understand, we choose to group the focus-categories into 6 main categories as shown below.
This grouping was created after a thorough investigation of the images attached to the service requests.
# map categories to our defintion of Main Categories
df_311['Focus Category'] = df_311['Category'].apply(lambda x: correct_catagory(x, catagory_mapping))
# creating a bar chart just to get an intuition of how frequent the different categories are
fig = plt.figure(figsize = (22,4))
# count the number of requests for each main category
category_counts = df_311.groupby('Focus Category').size()
category_counts = category_counts.sort_values(ascending = True)
barplot = plt.barh(category_counts.index, category_counts)
plt.xlabel('category count')
plt.title('Final distribution of the defined Focus Categories')
plt.show()
Now let's take a closer look at the important column, Neighborhood.
When creating a 311 request, it is assigned to a location (longitude and latitude). This is location is mapped to the official police districts of San Francisco, above that, the user is then also able to specify a Neighborhood. In order to get an overview of the different requests the data set contains, let's have a look at how many there are.
# print number of police districts and neighborhoods
neighborhoods = df_311.Neighborhood.unique()
pdistricts = df_311['Police District'].unique()
print('Number of police districts:', len(pdistricts))
print('Number of neighborhoods:', len(neighborhoods))
Number of police districts: 12 Number of neighborhoods: 118
Alright, on the top level 'Police District we have 12 different districts. In order to make our visualizations most relevant for the user, we want to provide a more detailed division of San Francisco. However, the neighborhoods seems to be typed in manually by the user, e.g. including neighborhoods such as NaN, 8 and other spelling mistakes. Therefore, we choose to utilize the offical Neighborhood notification boundaries created by Department of City Planning (1) as standard and map neighborhoods to this standard.
# map the 118 neighborhoods to the standard of 37 SF Planning Neighborhoods
df_311['SF Neighborhood'] = df_311['Neighborhood'].apply(lambda x: correct_neighborhood(x, neighborhood_mapping))
print('Number of neighborhoods:', len(df_311['SF Neighborhood'].unique()))
Number of neighborhoods: 37
Let's take a look at the distribution of the 37 different neighborhoods.
# creating a bar chart just to get an intuition of how frequent the different categories are
fig = plt.figure(figsize = (22,10))
# count number of request for each of the 37 neighborhoods
n_counts = df_311.groupby('SF Neighborhood').size()
n_counts = n_counts.sort_values(ascending = True)
barplot = plt.barh(n_counts.index, n_counts)
plt.xlabel('Count')
plt.title('Distribution of 311 cases by Neighborhoods')
plt.show()
At last, we transform and create temporal features used for further analysis.
# convert columns containing time into datetime
# create new column 'Timestamp' with the full datetime
df_311['Opened_Timestamp'] = pd.to_datetime(df_311['Opened'], format='%m/%d/%Y %I:%M:%S %p')
df_311['Closed_Timestamp'] = pd.to_datetime(df_311['Closed'], format='%m/%d/%Y %I:%M:%S %p')
df_311['Updated_Timestamp'] = pd.to_datetime(df_311['Updated'], format='%m/%d/%Y %I:%M:%S %p')
# create new column with the hour of the day
df_311['Opened_Year'] = df_311['Opened_Timestamp'].dt.year
df_311['Opened_Month'] = df_311['Opened_Timestamp'].dt.month
df_311['Opened_DayOfWeek'] = df_311['Opened_Timestamp'].dt.weekday
df_311['Opened_Weekday'] = df_311['Opened_Timestamp'].dt.day_name()
df_311['Opened_Hour'] = df_311['Opened_Timestamp'].dt.hour
df_311['Opened_HourOfWeek'] = df_311['Opened_Hour'] + df_311['Opened_Timestamp'].dt.dayofweek * 24
# create a new column containing the closing time
df_311.loc[:,'TimeToClose'] = (df_311['Closed_Timestamp'] - df_311['Opened_Timestamp']).astype('timedelta64[s]')/60/60/24 #h for hours
In order to join the San Francisco House Prices dataset with the San Francisco 311 dataset, we had to come op with a shared key to combine. Since our analysis focus around the neighborhoods of San Francisco, this seemed like an obvious option. However, the San Francisco House Prices dataset only includes the address and a zipcode. Unfortunately, some neighborhoods share the same zipcode. Thus, we utilised the Google Maps API to extract the longitude and latitude of each sold house based on the address. This enables us to decide which neighborhood a house belongs to by matching the geographical location to each neighborhood represented by a polygon in the .geojson file.
Warning: Running the below code takes a considerable amount of time!
# map house prices dataset to same standard of 37 neighborhoods based on longitude and latitude
df_housePrices['Neighborhood'] = df_housePrices.apply(lambda row: point_to_neighborhood(row['longitude'],row['latitude']), axis=1)
df_housePrices.head(5)
| Id | Address | Sold Price | City | Zip | State | Region | longitude | latitude | formatted_address | Neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 8 | 225 26th Ave #2 | 1590000.0 | San Francisco | 94121 | CA | San Francisco | -122.486445 | 37.785155 | 225 26th Ave, San Francisco, CA 94121, USA | Outer Richmond |
| 12 | 12 | 2425 21st Ave | 2100000.0 | San Francisco | 94116 | CA | San Francisco | -122.478103 | 37.742442 | 2425 21st Ave, San Francisco, CA 94116, USA | Parkside |
| 15 | 15 | 855 Folsom St APT 903 | 1155500.0 | San Francisco | 94107 | CA | San Francisco | -122.402142 | 37.780767 | 855 Folsom St APT 903, San Francisco, CA 94107... | South of Market |
| 21 | 21 | 1779 10th Ave | 1380000.0 | San Francisco | 94122 | CA | San Francisco | -122.467047 | 37.755080 | 1779 10th Ave, San Francisco, CA 94122, USA | Inner Sunset |
| 38 | 38 | 1340 Leavenworth St | 652000.0 | San Francisco | 94109 | CA | San Francisco | -122.415971 | 37.793506 | 1340 Leavenworth St, San Francisco, CA 94109, USA | Nob Hill |
Thus, we manage to map all ~10.000 sold houses to 36 unique neighborhoods, where the only one not represented is Golden Gate Park. As the name refers, this is a park populated by few people.
Since the dataset on San Francisco Neighborhood Socio-Economic Profiles was transformed into a CSV manually no further pre-processing was needed.
The first part of the exploratory analysis will focus on spatial patterns.
First, let's have a look at the location of the different neighborhoods. We will use a Folium choropleth for this purpose.
# group by neighborhood and count number of requests
sf = df_311.groupby('SF Neighborhood').count()
sf = pd.DataFrame(sf,columns=['Category']) # remove unneeded columns
sf.reset_index(inplace=True) # default index, otherwise groupby column becomes index
sf.rename(columns={'SF Neighborhood':'Neighborhood','Category':'Number of request'}, inplace=True)
sf.sort_values(by='Number of request', inplace=True, ascending=False)
# San Francisco latitude and longitude values
latitude = 37.77
longitude = -122.42
sf_neighborhood_geo = 'data/sf_planning_neighborhoods.geojson'
# Create map and zoom in on San Francisco
sf_map = folium.Map(
location=[latitude,longitude],
zoom_start=12)
folium.Choropleth(
geo_data=sf_neighborhood_geo,
name="choropleth",
data=sf,
columns=['Neighborhood','Number of request'],
key_on='feature.properties.neighborho', #define key in geojson to map neighborhoods to choropleths
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
).add_to(sf_map)
# add choropleths to the San Francisco map
folium.LayerControl().add_to(sf_map)
# display the map
sf_map
The plot gave us insights on one important matter
In the next step we want to visualize how the different categories are distributed across space. We will use Folium and plot the cases on a map, using their latitude/longitude coordinates. Due to heavy computations when using Folium, we will only plot two categories here and restrict the time period to the year 2020. Also we will use two categories that have approximately the same amount of requests in this time period and from which we expect different spatial patterns. We chose Sewer Issues (associated with more calm residential areas) and Litter Receptacles (associated with more crowded areas).
# define categories to plot
categories_to_plot = ['Sewer Issues', 'Litter Receptacles']
colors = ['blue','red','green','yellow']
# plotting data as a point scatter plot
for j, category in enumerate(categories_to_plot):
print('Spatial distribution for', category)
# creating the map with zoom on San Francisco
map_scatter = folium.Map(#location=[37.7749, -122.4194], # town hall
location=[37.7709, -122.4394], # customized
#location= [37.76880514149926, -122.44645648345819], # twin peaks
zoom_start = 13, #13 if no width and height specified
#tiles = 'Stamen Toner',
height = 600,
width = 1000,
)
data_scatter = df_311[(df_311.Category == category) &
(df_311.Opened_Year == 2020) ]
print('Number of observations:', len(data_scatter))
for i in data_scatter.index:
folium.CircleMarker(location = [data_scatter['Latitude'][i],data_scatter['Longitude'][i]], radius = 0.01, color = colors[j]).add_to(map_scatter)
display(map_scatter)
Spatial distribution for Sewer Issues Number of observations: 8860
Spatial distribution for Litter Receptacles Number of observations: 8267